The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis
نویسندگان
چکیده
The Uppsala Corpus of Student Writings consists of Swedish texts produced as part of a national test of students ranging in age from nine (in year three of primary school) to nineteen (the last year of upper secondary school) who are studying either Swedish or Swedish as a second language. National tests have been collected since 1996. The corpus currently consists of 2,500 texts containing over 1.5 million tokens. Parts of the texts have been annotated on several linguistic levels using existing state-of-the-art natural language processing tools. In order to make the corpus easy to interpret for scholars in the humanities, we chose the CoNLL format instead of an XML-based representation. Since spelling and grammatical errors are common in student writings, the texts are automatically corrected while keeping the original tokens in the corpus. Each token is annotated with part-of-speech and morphological features as well as syntactic structure. The main purpose of the corpus is to facilitate the systematic and quantitative empirical study of the writings of various student groups based on gender, geographic area, age, grade awarded or a combination of these, synchronically or diachronically. The intention is for this to be a monitor corpus, currently under development.
منابع مشابه
Annotating Errors in Student Texts: First Experiences and Experiments
We describe the creation of an annotation layer for word-based writing errors for a corpus of student writings. The texts are written in Swedish by students between 9 and 19 years old. Our main purpose is to identify errors regarding spelling, split compounds and merged words. In addition, we also identify simple word-based grammatical errors, including morphological errors and extra words. In ...
متن کاملAn annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملA Corpus-Based Contrastive Analysis of Stance Strategies in Native and Nonnative Speakers’ English Academic Writings: Introduction and Discussion Sections in Focus
The present study was an attempt to illustrate the interaction between writers and readers. Conveying of the writers’ voice, stance, and interaction with reader was put forward within this paradigm. Being a good academic writer is highly related to the use of these strategies. Adopting a position and persuading readers of claims are very important. This study was aimed at showing th...
متن کاملStripped of Authorship or Projected Identity? Iranian Scholars’ Presence in Research Articles
Research Article (RA) genre has been a significant area of research in academic writing over past decades. However, authors’ identity in RAs has not received much attention, especially in soft sciences like applied linguistics. This paper reports a corpus analysis of Iranian writers’ authorial presence markers in RAs in the field of applied linguistics. The corpus comprised 30 RAs (200,000 word...
متن کاملAnnotating an Arabic Learner Corpus for Error
This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, developing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to follow a known standard and to see whether the changes needed to move from a Frenc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016